{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classification with Structured/Tabular Data and Noisy Labels\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Consider Using Datalab\n",
    "<br/>\n",
    "\n",
    "If interested in detecting a wide variety of issues in your tabular data, check out the [Datalab tabular tutorial](https://docs.cleanlab.ai/stable/tutorials/datalab/tabular.html). Datalab can detect many other types of data issues beyond label issues, whereas CleanLearning is a convenience method to handle noisy labels with sklearn-compatible classification models.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this 5-minute quickstart tutorial, we use cleanlab with scikit-learn models to find potential label errors in a classification dataset with tabular features (numeric/categorical columns). Tabular (or *structured*) data are typically organized in a row/column format and stored in a SQL database or file types like: CSV, Excel, or Parquet. Here we consider a Student Grades dataset, which contains over 900 individuals who have three exam grades and some optional notes, each being assigned a letter grade (their class label). cleanlab automatically identifies _hundreds_ of examples in this dataset that were mislabeled with the incorrect final grade (data entry mistakes). \n",
    "\n",
    "This tutorial shows how to handle noisy labels and produce more robust classification models for your own tabular datasets. cleanlab's `CleanLearning` class automatically detects and filters out such badly labeled data, in order to train a more robust version of any Machine Learning model. No change to your existing modeling code is required! \n",
    "\n",
    "\n",
    "**Overview of what we'll do in this tutorial:**\n",
    "\n",
    "- Train a classifier model (here scikit-learn's ExtraTreesClassifier, although any model could be used) and use this classifier to compute (out-of-sample) predicted class probabilities via cross-validation.\n",
    "\n",
    "- Identify potential label errors in the data with cleanlab's `find_label_issues` method.\n",
    "\n",
    "- Train a robust version of the same ExtraTrees model via cleanlab's `CleanLearning` wrapper.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Quickstart\n",
    "<br/>\n",
    "    \n",
    "Already have an sklearn compatible `model`, tabular `data` and given `labels`? Run the code below to train your `model` and get label issues.\n",
    "\n",
    "\n",
    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
    "    \n",
    "```python\n",
    "\n",
    "from cleanlab.classification import CleanLearning\n",
    "\n",
    "cl = CleanLearning(model)\n",
    "_ = cl.fit(train_data, labels)\n",
    "label_issues = cl.get_label_issues()\n",
    "preds = cl.predict(test_data) # predictions from a version of your model \n",
    "                              # trained on auto-cleaned data\n",
    "\n",
    "\n",
    "```\n",
    "    \n",
    "</div>\n",
    "    \n",
    "Is your model/data not compatible with `CleanLearning`? You can instead run cross-validation on your model to get out-of-sample `pred_probs`. Then run the code below to get label issue indices ranked by their inferred severity.\n",
    "\n",
    "\n",
    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
    "    \n",
    "```python\n",
    "\n",
    "from cleanlab.filter import find_label_issues\n",
    "\n",
    "ranked_label_issues = find_label_issues(\n",
    "    labels,\n",
    "    pred_probs,\n",
    "    return_indices_ranked_by=\"self_confidence\",\n",
    ")\n",
    "    \n",
    "\n",
    "```\n",
    "    \n",
    "</div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install required dependencies\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use `pip` to install all packages required for this tutorial as follows:\n",
    "\n",
    "```ipython3\n",
    "!pip install cleanlab\n",
    "# Make sure to install the version corresponding to this tutorial\n",
    "# E.g. if viewing master branch documentation:\n",
    "#     !pip install git+https://github.com/cleanlab/cleanlab.git\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Package installation (hidden on docs website).\n",
    "dependencies = [\"cleanlab\"]\n",
    "\n",
    "if \"google.colab\" in str(get_ipython()):  # Check if it's running in Google Colab\n",
    "    %pip install cleanlab  # for colab\n",
    "    cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n",
    "    %pip install $cmd\n",
    "else:\n",
    "    dependencies_test = [dependency.split('>')[0] if '>' in dependency \n",
    "                         else dependency.split('<')[0] if '<' in dependency \n",
    "                         else dependency.split('=')[0] for dependency in dependencies]\n",
    "    missing_dependencies = []\n",
    "    for dependency in dependencies_test:\n",
    "        try:\n",
    "            __import__(dependency)\n",
    "        except ImportError:\n",
    "            missing_dependencies.append(dependency)\n",
    "\n",
    "    if len(missing_dependencies) > 0:\n",
    "        print(\"Missing required dependencies:\")\n",
    "        print(*missing_dependencies, sep=\", \")\n",
    "        print(\"\\nPlease install them before running the rest of this notebook.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import numpy as np\n",
    "import pandas as pd \n",
    "from sklearn.preprocessing import StandardScaler, LabelEncoder\n",
    "from sklearn.model_selection import cross_val_predict, train_test_split\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.ensemble import ExtraTreesClassifier\n",
    "\n",
    "from cleanlab.filter import find_label_issues\n",
    "from cleanlab.classification import CleanLearning\n",
    "\n",
    "SEED = 100 \n",
    "\n",
    "np.random.seed(SEED)\n",
    "random.seed(SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load and process the data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first load the data features and labels (which are possibly noisy).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grades_data = pd.read_csv(\"https://s.cleanlab.ai/grades-tabular-demo-v2.csv\")\n",
    "grades_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_raw = grades_data[[\"exam_1\", \"exam_2\", \"exam_3\", \"notes\"]]\n",
    "labels_raw = grades_data[\"letter_grade\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we preprocess the data. Here we apply one-hot encoding to features with categorical data, and standardize features with numeric data. We also perform label encoding on the labels, as cleanlab's functions require the labels for each example to be an interger integer in 0, 1, …, num_classes - 1. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "categorical_features = [\"notes\"]\n",
    "X_encoded = pd.get_dummies(X_raw, columns=categorical_features, drop_first=True)\n",
    "\n",
    "numeric_features = [\"exam_1\", \"exam_2\", \"exam_3\"]\n",
    "scaler = StandardScaler()\n",
    "X_processed = X_encoded.copy()\n",
    "X_processed[numeric_features] = scaler.fit_transform(X_encoded[numeric_features])\n",
    "\n",
    "encoder = LabelEncoder()\n",
    "encoder.fit(labels_raw)\n",
    "labels = encoder.transform(labels_raw)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Bringing Your Own Data (BYOD)?\n",
    "\n",
    "You can easily replace the above with your own tabular dataset, and continue with the rest of the tutorial.\n",
    " \n",
    "Your classes (and entries of `labels`) should be represented as integer indices 0, 1, ..., num_classes - 1.   \n",
    "For example, if your dataset has 7 examples from 3 classes, `labels` might look like: `np.array([2,0,0,1,2,0,1])`\n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Select a classification model and compute out-of-sample predicted probabilities\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we use a simple ExtraTrees classifier that fits various randomized decision tress on our data, but you can choose any suitable scikit-learn model for this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = ExtraTreesClassifier()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To find potential labeling errors, cleanlab requires a probabilistic prediction from your model for every datapoint. However, these predictions will be _overfitted_ (and thus unreliable) for examples the model was previously trained on. For the best results, cleanlab should be applied with **out-of-sample** predicted class probabilities, i.e., on examples held out from the model during the training.\n",
    "\n",
    "K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. An additional benefit of cross-validation is that it provides a more reliable evaluation of our model than a single training/validation split. We can implement this via the `cross_val_predict` method from scikit-learn:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_crossval_folds = 5  \n",
    "pred_probs = cross_val_predict(\n",
    "    clf,\n",
    "    X_processed,\n",
    "    labels,\n",
    "    cv=num_crossval_folds,\n",
    "    method=\"predict_proba\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Use cleanlab to find label issues\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify poorly labeled instances in our data table. For a dataset with N examples from K classes, the labels should be a 1D array of length N and predicted probabilities should be a 2D (N x K) array. Here we request that the indices of the identified label issues be sorted by cleanlab's self-confidence score, which measures the quality of each given label via the probability assigned to it in our model's prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ranked_label_issues = find_label_issues(\n",
    "    labels=labels, pred_probs=pred_probs, return_indices_ranked_by=\"self_confidence\"\n",
    ")\n",
    "\n",
    "print(f\"Cleanlab found {len(ranked_label_issues)} potential label errors.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's review some of the most likely label errors:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_raw.iloc[ranked_label_issues].assign(label=labels_raw.iloc[ranked_label_issues]).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These final grades look suspicious and should definitely be carefully re-examined! This is a straightforward approach to visualize the rows in a data table that might be mislabeled."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Train a more robust model from noisy labels\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following proper ML practice, let's split our data into train and test sets.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, labels_train, labels_test = train_test_split(\n",
    "    X_encoded,\n",
    "    labels,\n",
    "    test_size=0.2,\n",
    "    random_state=SEED,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We again standardize the numeric features, this time fitting the scaling parameters solely on the training set.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler = StandardScaler()\n",
    "X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])\n",
    "X_test[numeric_features] = scaler.transform(X_test[numeric_features])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now train and evaluate the original ExtraTrees model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf.fit(X_train, labels_train)\n",
    "acc_og = clf.score(X_test, labels_test)\n",
    "print(f\"Test accuracy of original model: {acc_og}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "cleanlab provides a wrapper class that can be easily applied to any scikit-learn compatible model. Once wrapped, the resulting model can still be used in the exact same manner, but it will now train more robustly if the data have noisy labels.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = ExtraTreesClassifier()  # Note we first re-initialize clf\n",
    "cl = CleanLearning(clf)  # cl has same methods/attributes as clf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following operations take place when we train the cleanlab-wrapped model: The original model is trained in a cross-validated fashion to produce out-of-sample predicted probabilities. Then, these predicted probabilities are used to identify label issues, which are then removed from the dataset. Finally, the original model is trained on the remaining clean subset of the data once more.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = cl.fit(X_train, labels_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get predictions from the resulting model and evaluate them, just like how we did it for the original scikit-learn model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds = cl.predict(X_test)\n",
    "acc_cl = accuracy_score(labels_test, preds)\n",
    "print(f\"Test accuracy of cleanlab-trained model: {acc_cl}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the test set accuracy slightly improved as a result of the data cleaning. Note that this will not always be the case, especially when we evaluate on test data that are themselves noisy. The best practice is to run cleanlab to identify potential label issues and then manually review them, before blindly trusting any accuracy metrics. In particular, the most effort should be made to ensure high-quality test data, which is supposed to reflect the expected performance of our model during deployment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spending too much time on data quality?\n",
    "\n",
    "Using this open-source package effectively can require significant ML expertise and experimentation, plus handling detected data issues can be cumbersome.\n",
    "\n",
    "That’s why we built [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) -- an automated platform to find **and fix** issues in your dataset, 100x faster and more accurately.  Cleanlab Studio automatically runs optimized data quality algorithms from this package on top of cutting-edge AutoML & Foundation models fit to your data, and helps you fix detected issues via a smart data correction interface. [Try it](https://cleanlab.ai/) for free!\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img src=\"https://raw.githubusercontent.com/cleanlab/assets/master/cleanlab/ml-with-cleanlab-studio.png\" alt=\"The modern AI pipeline automated with Cleanlab Studio\">\n",
    "</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "\n",
    "if acc_og >= acc_cl:  # check cleanlab has improved prediction accuracy\n",
    "    raise Exception(\"Cleanlab training failed to improve model accuracy.\")\n",
    "    \n",
    "# this file contains true and noisy labels\n",
    "true_data = pd.read_csv(\"https://s.cleanlab.ai/student-grades-demo.csv\")\n",
    "true_errors = np.where(true_data[\"letter_grade\"] != true_data[\"noisy_letter_grade\"])[0]\n",
    "if not all(x in true_errors for x in ranked_label_issues[:5]):  # check top errors are indeed errors\n",
    "    raise Exception(\"Some of the top listed errors are not actually label errors.\")"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "cda20062bc42cfdcaa0f9720c0b28e880bba110e9dfce6c1689934eec9b595a1"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
